Delay Scheduling Based Replication Scheme for Hadoop Distributed File System
نویسنده
چکیده
The data generated and processed by modern computing systems burgeon rapidly. MapReduce is an important programming model for large scale data intensive applications. Hadoop is a popular open source implementation of MapReduce and Google File System (GFS). The scalability and fault-tolerance feature of Hadoop makes it as a standard for BigData processing. Hadoop uses Hadoop Distributed File System (HDFS) for storing data. Data reliability and faulttolerance is achieved through replication in HDFS. In this paper, a new technique called Delay Scheduling Based Replication Algorithm (DSBRA) is proposed to identify and replicate (dereplicate) the popular (unpopular) files/blocks in HDFS based on the information collected from the scheduler. Experimental results show that, the proposed method achieves 13% and 7% improvements in response time and locality over existing algorithms respectively.
منابع مشابه
Efficient Data Replication Scheme based on Hadoop Distributed File System
Hadoop distributed file system (HDFS) is designed to store huge data set reliably, has been widely used for processing massive-scale data in parallel. In HDFS, the data locality problem is one of critical problem that causes the performance decrement of a file system. To solve the data locality problem, we propose an efficient data replication scheme based on access count prediction in a Hadoop...
متن کاملAdaptive Data Replication Scheme Based on Access Count Prediction in Hadoop
Hadoop, an open source implementation of the MapReduce framework, has been widely used for processing massive-scale data in parallel. Since Hadoop uses a distributed file system, called HDFS, the data locality problem often happens (i.e., a data block should be copied to the processing node when a processing node does not possess the data block in its local storage), and this problem leads to t...
متن کاملA Comparative Analysis of MapReduce Scheduling Algorithms for Hadoop
Today’s Digital era causes escalation of datasets. These datasets are termed as “Big Data” due to its massive amount of volume, variety and velocity and is stored in distributed file system architecture. Hadoop is framework that supports Hadoop Distributed File System (HDFS)for storing and MapReduce for processing of large data sets in a distributed computing environment. Task assignment is pos...
متن کاملPerformance Evaluation of Stream Log Collection Using HADOOP Distributed File System
Recently stream logging has been referred to widely by web based and product based companies. Stream logging is one of the most important topic of agenda in business re-engineering. Business re-engineering is done in order to improve the effectiveness and productiveness of a particular product or service. Stream logging is achieved with minimum cost using transaction based model over a distribu...
متن کاملAn Experimental Evaluation of Performance of A Hadoop Cluster on Replica Management
Hadoop is an open source implementation of the MapReduce Framework in the realm of distributed processing. A Hadoop cluster is a unique type of computational cluster designed for storing and analyzing large datasets across cluster of workstations. To handle massive scale data, Hadoop exploits the Hadoop Distributed File System termed as HDFS. The HDFS similar to most distributed file systems sh...
متن کامل